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Abstract 

We present approximate inference for logistic Gaussian process density estimation in a 
grid using Laplace's method to integrate over the non-Gaussian posterior distribution of 
latent values. The flexibility of a Gaussian process prior makes it attractive for modelling 
an unknown density, while the smoothness properties of estimates can be controlled via the 
prior covariance structure. The covariance function parameters can be estimated using the 
Laplace approximated marginal likelihood and gradient-based optimization. The Laplace 
approximation is sufficiently fast for practical interactive visualisation of ID and 2D den- 
sities, and our experiments with simulated and real ID data show that the estimation 
accuracy is close to a Markov chain Monte Carlo logistic Gaussian process and state-of- 
the-art hierarchical infinite Gaussian mixture models. We also consider a reduced-rank 
approximation to speed up the computations for dense 2D grids, and demonstrate density 
regression with the proposed Laplace approach. 

Keywords: Gaussian process, logistic transformation, density estimation, approximate 
inference, Laplace's method 



1. Introduction 



In this paper, we present approximate inference for a Gaussian process (GP) density esti- 
mation using the Laplace (LA) approximation to integrate over the non-Gaussian posterior 
distribution. GP priors provide a flexible approach for density estimation in which the 
smoothness properties of estimates can be controlled via the prior covariance structures. 
Our goal is to estimate an unknown density function without restricting to any specific 
parametrized form, and therefore, we assume a logistic Gaussian process (LGP) model. A 
challenge with LGP is how to handle its intractable integration term required to ensure 



normalization. We use the same finite-dimensional approximation as described by Tokdar 



(20071) and evaluate the integral and the Gaussian process in a grid. Tokdar and Ghosh 



(2007) show conditions for the posterior consistency of the LGP density estimation, and 



bkdar (2007) shows that as the spacing between the grid points decreases, the Kullback- 



Leibler divergence from an infinite-dimensional model to the finite-dimensional approxima- 



tion converges to zero. Related results are also given by Griebel and Hegland (2010). We 



focus only on bounded intervals, but densities with unbounded intervals could be handled 



as proposed by Tokdar (2007). 
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Tokdar (2007) integrated over the latent values with the Metropolis-Hastings sampling 



and over the potential grid point sets with the Metropolis-Hastings-Green sampling. The 
purpose of the latter part is to keep the number of active grid points small and automatically 
find the most important grid point locations. By replacing the sampling with an analytic 
approximation, inference can be made faster even if a fixed and finer grid is assumed, which 
is why we consider Laplace's method to approximate posterior inference for LGP density 
estimation. 

The proposed LA method is related to other approaches in the literature. Given fixed co- 
variance function parameters, the LA approximation for the posterior distribution involves 
finding the posterior mode, which is equivalent to a penalized maximum likelihood estima- 
tor considered by Leonard (1978) and Thorburn (1986). The marginal likelihood can be 



approximated with Laplace's method, which enables fast gradient-based type-II maximum 
a posteriori (MAP) estimation of the covariance function parameters, or alternatively, the 



posterior distribution can be integrated over. Lenk ( 1991 ) uses a truncated Karhunen-Loeve 



representation and derives the moments of the density process, from which it is also pos- 
sible to obtain marginal likelihood approximation. Lenk evaluates the marginal likelihood 
of hyperparameter values in a grid, while we use gradients and the BFGS quasi-Newton 
optimization or Markov chain Monte Carlo (MCMC) integration. In Lenk's approach the 
mean of the density process is guaranteed to be positive, but there is no such a guarantee 
for the whole posterior. Also the spectral representation restricts the selection of covariance 



functions. Lenk (2003) refines his approach with a Fourier series expansion and MCMC 



sampling of the hyperparameters. 

The computational complexity of the proposed LA approach is dominated by the covari- 
ance matrix operations, which scale in straightforward implementation as 0(m 3 ), where m 
is the number of grid points. However, because of the discretization, the computational com- 
plexity is independent from the number of observations. Applying the proposed approach 
with the default grid size 400, ID and 2D density estimation takes about two seconds, 
which facilitates interactive visualization of data with density estimates and violin plots 
(see, e.g., Figures [3]-[6] and [8]) . Additionally, we consider the fast Fourier transform (FFT) 
to speed up the computations with an even grid and stationary covariance functions. To 
avoid the cubic computational scaling in m with dense 2D grids, we also exploit Kronecker 
product computations to obtain a reduced-rank approximation of the exact prior covariance 
structure. 

The number of grid points grows exponentially with the data dimension d in the pro- 
posed approach. Although the Kronecker product approach is suitable for reducing the 
computation time with a larger d, the exponentially increasing number of latent values 
makes the proposed approach impractical for more than three or four dimensions. |Adams| 
et al. (2009) propose an alternative GP approach called Gaussian process density sampler 



(GPDS) in which the numerical approximation of the normalizing term in the likelihood 
is avoided by a conditioning set and an elaborated rejection sampling method. The condi- 
tioning set is generated by the algorithm, which automatically place m s points where they 
are needed, making the estimation in higher-dimensional spaces easier. However, the com- 
putational complexity of GPDS scales as 0((n + m s ) 3 ), which depends also on the number 
of data points n. 
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In the next section, we review the basics of the logistic Gaussian process and then present 
the Laplace approach in Section |3j In Section [4j we demonstrate the LA approach and 
compare it to MCMC and the logistic Gaussian process to a hierarchical infinite Gaussian 
mixture models by Griffin ( 2010[ ). 



2. Density estimation with logistic Gaussian process priors 

We consider the problem of computing a density estimate p(x) given n independently drawn 
(i-dimensional data points xj, . . . ,x n from an unknown distribution in a finite region V of 
In this paper, we focus only on d = {1,2}. To find an estimate p for the unknown 
density, we can maximize the following log-likelihood functional 



L(p) = ^logp(xj 



(1) 



i=i 



with the constraints J v p(x.)dx = 1, and p(x) > for x£ V. The limiting solution leads to 



a mixture of delta functions located at the observations (Leonard, 1978), which is why we 



need to set prior beliefs about the unknown density to obtain more realistic estimates. 

To introduce the constraints of the density being non-negative and that its integral over 
V is equal to one, we employ the logistic density transform 



p(x) 



exp(/(x)) 



(2) 



(Leonard, 1978), where / is an unconstrained latent function. To smooth the density esti- 



mates, we place a Gaussian process prior for /, which enables us to set the prior assumptions 
about the smoothness properties of the unknown density p via the covariance structure of 
the GP prior. 

We assume a zero-mean Gaussian process g(x) ~ QV (0, k(x, x')), where the covariance 
function is denoted with k(x, x') for a pair of inputs x and x'. An example of a widely used 
covariance function is the stationary squared exponential covariance function 



k(x, x') 



a 2 exp 



1 d 



k=l 




(3) 



where hyperparameters 9 = {a , h, . . . , Id} govern the smoothness properties of / (Ras- 



mussen and Williams, 2006). The length-scale hyperparameters 1%, . . . , Id control how fast 
the correlation decreases in different dimensions, and a 2 is a magnitude hyperparameter. 

For the latent function / in equation Q, we assume the model /(x) = <?(x) + h(x) T /3, 
where the GP prior <?(x) is combined with the explicit basis functions h(x). Regression 
coefficients are denoted with f3, and by placing a Gaussian prior (5 ~ A/"(b, B) with mean 
b and covariance B, the parameters (3 can be integrated out from the model, which results 
in the following GP prior for /: 



/(x) ~ QV (h(x) T b, k(x,x') + h(x) T 5h(x')) 



(4) 



(O'Hagan, 1978 Rasmussen and Williams, 2006). For the explicit basis functions, we use 



the second-order polynomials h(x) = [x\ 



x d x 



2]T 



which leads to a GP prior that 
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without basis f. without basis f. with basis f. with basis f. 




95% CI 
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-5 5 -5 5 -5 5 -5 5 



X X X X 

Figure 1: An illustration of the logistic Gaussian process density estimation with and with- 
out the basis functions. The first plot visualizes the posterior latent function 
without the basis functions, and the second plot shows the corresponding density 
estimate with the logistic density transform ([2]) given 50 observations from the 
mixture of two Gaussians: ^Af(—2, 1) + \N{2, 2 2 ). The third plot visualizes the 
posterior latent function with the second-order polynomials as the basis functions, 
and the fourth plot shows the corresponding density estimate. The hyperparam- 
eter values of the squared exponential covariance function ^ are in this example 
a 2 = 1 and l\ = ^. 



can favour density estimates where the tails of the distribution go eventually to zero. The 
effect of the basis functions is demonstrated in an illustrative example of Figure [T] and later 
in Section [4.11 

We discretize V into m subregions (or intervals in ID), and collect the coordinates of the 
subregions into an m x d matrix X, where the i'th row denotes the center point of the i'th 
subregion. Given X, the GP prior Q results in the Gaussian distribution over the latent 
variables 

p(f\X, 6) = J\f(i\Hb, K + HBH T ), (5) 

where f is a column vector of m latent values associated with each subregion. The entries 
of the mxm covariance matrix K are determined by the input points X and the covariance 
function. The matrix H of size m x 2d contains the values of the fixed basis functions 
evaluated at X. We assume a somewhat weakly informative prior distribution for the 
regression coefficients by fixing the mean b = which is a 2d x 1 vector of zeros, and the 
covariance B = W 2 l2d, where l2d is the identity matrix of size 2d x 2d. The regression 
coefficient for the quadratic term xi should be negative in order to make the tails of a 
distribution to go towards zero. However, the prior distribution does not force the negativity 
of the coefficient. In our experiments, the posterior of the coefficient was clearly below zero, 
but it would also be possible to force the negativity of the coefficient by rejection sampling. 

After the discretization, the log-likelihood contribution of an observation belonging to 
the i'th subregion can be written as 

/ Wiexp(fi) . 
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where the latent /j is associated with the i'th subregion. Throughout this paper, we assume 
a regular grid, and therefore the weights wi, ... ,w m have all the same value and can be 
omitted from ([6]). The number of observations that fall within the i'th subregion is denoted 
with iji and all the count observations with an m x 1 vector y. The overall log- likelihood 
contribution of the n observations is given by 

logp(y|f) = y T f - nlog ^fjexp(/ J ) j . (7) 

The prior §5§ and the likelihood ([7]) are combined by Bayes' rule, which results in the 
conditional posterior distribution of the latent variables 

p(f\X,y,e) = ^p(i\X,e)p(y\f), (8) 

where Z = p(y\X,0) = J p(f \X, 9)p(y\f )df is the marginal likelihood. Due to the non- 
Gaussian likelihood ([7]), the posterior distribution is also non-Gaussian, and approximate 
methods are needed to integrate over f . 

3. Approximate inference 

In this section, we discuss the implementation issues of Laplace's method for logistic GP 



density estimation. In Section 3.1, we show how the mode finding can be done efficiently 



with the fast Fourier transform and present a reduced-rank approximation suitable for d > 1 



cases with dense grids (large m). In Section 3.2, we consider the LA approach for logistic 



GP density regression, and in Section 3.3, we give a brief description of inference with 
MCMC. 

3.1 Inference with the Laplace approximation 

Our approach is motivated by the Laplace approximation in GP classification (|Williams and 



Barber, 1998 Rasmussen and Williams, 2006]), but the implementation differs from classi- 



fication because each factorized term in the likelihood ([7]) depends on all the latent values 
f. The implementation resembles the LA approximation for the point process intensity 



estimation with a GP prior as presented by Cunningham et al. (2008). 



The Laplace approximation is based on a second-order Taylor expansion for logp(f \X, y, 9 
around the posterior mode f , which results in the Gaussian approximation 

q{f\X,y,G)=N{f\f,V), (9) 
where f = argmaxf p(f \X, y, 9). The covariance matrix is given by 

z = (c- 1 + w)-\ (10) 

where C = K + HBH T and W = -VVf logp(y|f)| f=f . The likelihood Q leads to a full 
matrix W with the following structure 

iy = n(diag(u)-uu T )[] (11) 

1. We use the following notation: diag(w) with a vector argument means a diagonal matrix with the 
elements of the vector w on its diagonal, and diag(W) with a matrix argument means a column vector 
consisting of the diagonal elements of the matrix W . 
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where the non-negative entries of the vector u are given by 

_ exp(/j) 

E J= i ex p(/i) 



(12) 



In the implementation, forming the full matrix W can be avoided by using the vector u 

Similarly to multiclass classification with the softmax likelihood 
1998 Rasmussen and Williams, 2006), W is positive semidefinite, 



and the structure (11 



(Williams and Barber 



and because C is positive definite, p{f\X,y,Q) has a unique maximum. 
3.1.1 Newton's method for finding the mode 

We use Newton's method for finding the mode f . At each iteration, we need to compute 

f ncw = (£-1 + W r)-l Vj ( 13 ) 

where v = Wf + Vf logp(y|f). To increase the numerical stability of the computations, we 
use the factorization W = RR T ', where 



R = \/^((diag(u))2 — uu T (diag(u)) 2). 



(14) 



Instead of a direct implementation of 
e.g. 



131), we can apply the matrix inversion lemma (see, 



Harville, 1997) to write the Newton step in a numerically more preferable way as 
f new = C(I m - R(I m + R T CR) 



1 R T C)v. 



(15) 



The inversion of equation (15) can be computed by solving z from the linear system 
{I m + R T CR)z = R T Cv with the conjugate gradient method. As discussed, for instance, by 



Cunningham et al. (2008), a stationary covariance function and evenly spaced grid points 



lead to a Toeplitz covariance matrix which can be embedded in a larger circulant matrix 
enabling efficient computations by using the fast Fourier transform. With a Toeplitz co- 



variance matrix K, we can achieve a speed-up in the evaluation of the Newton step (15) 



because all the multiplications of K and any vector can be done efficiently in the frequency 
domain. By using FFT, these matrix-vector multiplications become convolution operations 
for which we are required only to form a single row of the embedded circulant matrix instead 



of the full matrix K. The rest of the matrix-vector multiplications in (15) are fast because 



the matrix H has only two (in ID) or four (in 2D) columns, and instead of forming the full 



matrix R, we can use the vector u and exploit the structure of R from equation (14). 



3.1.2 The approximate marginal likelihood and predictions 

Approximating the integration over f with the LA approximation enables fast gradient- 
based type-II MAP estimation for choosing the values for the covariance function hyper- 
parameters. After finding f, the approximate log marginal likelihood can be evaluated 
as 

1 



logp(y|X,0)«logg(y|X,0) 



■if T C- 1 f + logp(y|f) 



log \I m + R 1 CR\ 



(16) 



(see, e.g., Rasmussen and Williams, 2006). The first and second terms of (16) are fast to 
compute by using the results from Newton's algorithm, but the determinant term is more 
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difficult to evaluate. As shown by Cunningham et al. (2008) for the point process intensity 



estimation with the GP priors, the evaluation of a corresponding determinant term can 
be done efficiently by exploiting a low-rank structure of the observation model. In density 
estimation W has rank m — 1 due to the normalization over V, and a similar low-rank 
representation as in intensity estimation cannot be done for W. Therefore, we form the full 
mxm matrix and compute the Cholesky decomposition that scales as C(m 3 ) . Although this 
is computationally burdensome for large m, it is required only once per each evaluation of 
the marginal likelihood after the convergence of Newton's algorithm. Typically, in ID cases 
with the number of intervals m being less than one thousand, the LA approach is sufficiently 
fast for practical purposes (in our implementation the default size for m is 400). However, 
the computations can become restrictive in 2D with large m. In such cases, we consider a 
reduced-rank approximation for the prior covariance K to find a faster way to compute the 
determinant term and the approximate posterior covariance, as will be discussed in Section 



3.1.3 The gradients of the log marginal likelihood with respect to the hyperparameters 
9 can be computed by evaluating explicit and implicit derivatives similarly as shown by 
Rasmussen and Williams (2006). 



We place a prior distribution p{9) for the hyperparameters to improve the identifiability 
of the ratio of covariance function magnitude and length-scale parameters. We assume a 
weakly informative half Student-t distribution, as recommended for hierarchical models by 



Gelman ( 2006 ) , with four degrees of freedom and a variance that is equal to ten for log a 2 
(log magnitude), and for h, . . . , Id (length-scales), we assume otherwise the same prior but 
with a variance that is equal to one. The MAP estimate for the hyperparameters is found 
by maximizing the approximate marginal posterior p(8\y,X) oc q(y\X , 8)p(6) , in which we 
use the BFGS method. 

To compute the joint posterior predictive distribution, we marginalize over the latent 
values by Monte Carlo using 1000 latent samples drawn from the multivariate normal poste- 
rior predictive distribution. This sampling is needed only once after the MAP estimate for 
the hyperparameters has been found, and time consumed in this step is somewhat negligible 
compared to the other parts of the computations. The negativity of the coefficients for the 
quadratic term x|, required to force the tails of the density estimate to go towards zero, 
could easily be enforced by using rejection sampling in this phase, but it was unnecessary 
in our experiments. 

3.1.3 The reduced-rank approximation for 2D density estimation 

To speed up the inference with 2D grids when m is large, we propose a reduced-rank 
approximation that can be formed efficiently by exploiting Kronecker product computations. 
For separable covariance functions, the covariance structure of evenly spaced grid points 
can be presented in a Kronecker product form K = K\ (g) K2, where K\~ is an x 
matrix representing the covariances between the mk grid points in the fe'th dimension. For 
Kronecker products, many matrix operations scale efficiently: for example, the determinant 
\K\ (g) K.2\ can be computed by using the determinants of the smaller matrices K\ and 



K2 as |i^i| m2 |i^2 mi (Harville, 1997). To compute the approximate marginal likelihood 



(16), we need to evaluate the determinant term of a form \I m + R T (K\ <g> K2 + HBH T )R\. 



Unfortunately, the Kronecker product covariance structure does not preserve due to the 
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multiplication and summation operations, and this leads to the unfavourable 0(m 3 ) scaling. 
However, we can exploit the Kronecker product K\ ® K2 to obtain the eigendecomposition 
of K efficiently, and then form a reduced-rank approximation for K by using only the largest 
eigenvalues with the corresponding eigenvectors. The idea of using the eigendecomposition 
to construct a reduced-rank approximation for the covariance matrix has been previously 
mentioned by Rasmussen and Williams (2006 Chapter 8). By denoting an eigenvalue of 
K\ with r*i and an eigenvector of K\ with vi, and similarly, an eigenvalue of K2 with T2 
and an eigenvector of K2 with V2, then, r\T2 is an eigenvalue of K\ <g> K2 and vi (g) V2 an 



eigenvector of K 1 (g) K 2 corresponding to the eigenvalue r\T2 (Harville, 1997). Thus, instead 



of computing the eigendecomposition of K, which is an 0(m ) operation, we can compute 
the eigendecompositions of K\ and K2, which scales as 0{m\ + to|), and form a desired 
number of eigenvalues and eigenvectors with the Kronecker product computations to obtain 
the reduced-rank approximation for K. 

We approximate the exact prior covariance with 

VSV T 



K 



+ A, 



(17) 



where S is a diagonal matrix of size s x s having the s largest eigenvalues on its diagonal 
and V is an to x s matrix consisting of the corresponding eigenvectors. Similarly to the fully 
independent conditional (FIC) approximation (Snelson and Ghahramani [2006 ) , we use the 
exact full-rank diagonal by setting A = diag(diag(K) — &\a,g{V SV 1 )) in (17) to obtain a 



more accurate approximation. This does not cause trouble in the implementation since A 
is diagonal and non-negative. 

With the approximate prior (17), the mode can be found using Newton's method in 
a similar way as described in Section |3.1.1 To evaluate the marginal likelihood approx- 
imation, we need to compute efficiently the determinant in (16). The determinant term 
can be written as n\ A\\l T D 1 / 2 A- l D 1 ' 2 l\, where 1 is an to x 1 column vector of ones, 
D = n(diag(u)) is a diagonal matrix, and A = I m + D 1 / 2 (A + VSV T )D 1 / 2 . We have de- 

S 0" 

_ _ of size (s + 4) x (s + 4). To 

D 

avoid forming any to x m matrix, |^4| can be evaluated by applying the matrix determinant 
lemma 



fined V = [V H\ , which is of size to x (s + 4), and S 



(see, e.g. 



Harville, 1997) and A by applying the matrix inversion lemma. With 



the prior (17), we can also compute the approximate gradients of the marginal likelihood 
with respect to hyperparameters without resorting to any 0(to 3 ) matrix operations. 

For large to, we use the same fixed grid for the observations and predictions. After 
we have found the MAP estimate for the hyperparameters, we need to draw samples from 
q(f \X, y, 6) to marginalize over the latent values. The posterior covariance is approximated 
by 

E « C -C(C + W~ 1 )' 1 C, (18) 

where C = A + VSV T is the approximate prior covariance with the explicit basis functions 
evaluated at the grid points. We can rewrite (C + W~ l ) 

El(l T Eiy 



in (18) as 



(C + W 



-lN-l 



E 



-H T E, 



(19) 



where 



E = B X I\Z - ZD l / 2 V~S 1 / 2 (LL T )- 1 ~S 1 ' 2 V T D 1 ' 2 Z)D 1 



/2 



(20) 
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In (20) we have denoted Z = {I m + DA) which is diagonal and therefore fast to evaluate. 



The matrix L in equation ( |20[ ) is an (s + 4) x (s + 4) lower triangular matrix, and it can be 
computed using the Cholesky decomposition 

§1/2^1/2^1/3^1/2^ 



L = chol(/ s+4 + 



(21) 



which scales as 0((s + 4) 3 ). In the rest of the computations when drawing samples from 



the Gaussian approximation q(f\X,y,9) with the covariance matrix (18), the structure of 
C can be exploited to avoid forming any m x m matrix. 

With the reduced-rank approximation based on the eigendecomposition, we avoid choos- 
ing (or optimizing) the locations of inducing inputs, which is required in many sparse ap- 
proximations ( Quihonero-Candela and Rasmussen, 2005). However, due to the Kronecker 



product form, the covariance function must be separable with respect to the inputs, which 
leads to a restricted class of possible covariance functions. In this paper, we have tested the 
reduced-rank approximation with the squared exponential covariance function. 



3.2 Density regression with the Laplace approximation 

Logistic Gaussian processes are also suitable for estimating conditional densities p(t\x), 



where t is a response variable (see, e.g., Tokdar et al. 2010). We discretize both input and 



target space in a finite region to model the conditional densities with the logistic GP. In 
this paper, we focus on modelling the conditional density of t given a univariate predictor 
x, which leads to a 2D grid similarly as in the case of 2D density estimation. We denote 
the number of intervals in input space with m x and the number of intervals in target space 
with mt. To estimate the conditional densities, we write the log-likelihood contribution of 
all the observations as 



logp(y|f) = yffi ~ n i l °S /^2 eic P(fi,. 



J =1 




(22) 



where fa a is the j'th element of fj. The mt x 1 vector fj contains all the latent values asso- 
ciated with each subregion conditioned to the input Xj, that is, the latent values associated 
with the i'th slice of the grid. Similarly, y i is a vector of length mt and consists of the 
number of observations in each subregion of the i'th slice of the grid given x%. The vector f 
contains all the latent values of the grid and y all the count observations. To approximate 
the resulting non-Gaussian posterior distribution, we use again Laplace's method. The like- 



3.1.1 



lihood ( 22[) re sults in that W in equation (|10|) is a block-diagonal matrix. Similarly as in 
Section 



the i'th block of W can be factorized into RiRj ', where 



Ri = y/rTi((di&g(ui))2 - UjUj (diag(uj)) 2) 



(23) 

The total number of observations in the i'th slice of the grid is denoted with rtj, and the 
vector Uj is formed as in equation (12), but by considering only the latent variables fj. 



Because the LA approach for the LGP density regression follows closely to the derivation 



presented in Section 3.1 the implementation issues are omitted here. The density regression 
becomes computationally challenging with dense grids and when applied to a larger number 
of input dimensions d, and therefore, computational speed-ups are required, but we do not 
consider these in this paper. 
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3.3 Markov chain Monte Carlo 



MCMC sampling enables approximating the integration over the posterior distribution with- 
out limiting to a Gaussian form approximation. MCMC can be extremely slow but in the 
limit of a long run it provides an exact result by which the accuracy of the LA approach 
can be measured. 

We approximate the posterior distribution by sampling alternatively from the condi- 
tional posterior of the latent values p(f\X, y, 9) by using scaled Metropolis-Hastings sam- 



pling (Neal, 1998), and from the conditional posterior of the covariance function parameters 



p(9\f) by using Hamiltonian Monte Carlo (Duane et al. 1987 Neal, 1996). The predictive 



distribution is estimated by sampling from the approximate mixture of multivariate Gaus- 
sians. 



4. Experiments 



In this section, we examine the performance of the Laplace approximation for the logistic 
Gaussian process (Laplace-LGP) with several simulated and real datasets. We compare 
Laplace-LGP to the MCMC approximation (MCMC-LGP) and to Griffin's ( |2010[ ) hierar- 
chical infinite Gaussian mixture models with Common Component Variance (CCV-mixture) 
and Different Component Variance (DCV-mixture). We compare Laplace-LGP only to ad- 
vanced Bayesian kernel methods (Griffin, 2010), since |Tokdar| ( [2007l ) and |Adams|fl2009D have 
already shown that logistic GP works better than simple kernel methods (such as the Parzen 
method). Griffin showed that CCV- and DCV-mixture models performed equally or better 
than other default priors for mixture models, which is why the other priors are excluded in 
our comparisons. Laplace-LGP and MCMC-LGP were implemented using the MATLAB 
toolbox GPstuff] ( jVanhatalo et al.[ |2012[ ), and CCV/DCV-mixtures were computed using 
Griffin's codeJ3 

The squared exponential covariance function was employed in all the experiments with 
LGP. We also tested Matern, exponential, rational quadratic and additive combinations, 
but the results did not improve considerably with these different covariance functions. To 
ensure that the same prior is suitable for different scales, we normalized the grid to have 
zero-mean and unit-variance in all the experiments. The convergence of MCMC chains 



was diagnosed with visual inspection and the potential scale reduction factor (Brooks and 



Gelman, 1998). The sufficient length of MCMC chains were determined by estimating the 



effective sample size by Geyer's initial monotone sequence estimator (Geyer, 1992). 



The rest of the section is divided into six parts. First in Section 4.1, we show the effect 



of the basis functions on density estimates. We compare the performances of Laplace-LGP, 



MCMC-LGP and CCV/DCV-mixtures with simulated ID data in Section [42j and with 
real ID data in Section [4.3| In Section 4.4 we test how different grid sizes and the number 



of data points affect density estimates. We illustrate density estimation with simulated 2D 



data in Section 4.5, and finally, we demonstrate density regression with one simulated data 
in Section I3~6l 



2 . http : / /bees . aalto.fi/en / research /bayes / gpstuff / 

3. http://www.kent.ac.uk/ims/personal/jeg28/BDEcode.zip 
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is f. 



Figure 2: The effect of basis functions for the Gaussian and Student^ distributions with 50 
or 100 observations. The solid line is the true log-density, the dashed and dash- 
dotted lines are the mean estimates without and with the explicit basis functions. 
The explicit basis functions provide additional prior information which effect 
shows especially in the tails with only few or no observations. 
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A 



Gamma Gamma + Gaussian 
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Figure 3: Example results of density estimation and related uncertainties for four different 
simulated datasets with n = 100. 



4.1 The effect of basis functions 

Figure [2] shows how the explicit basis functions provide additional prior information for the 
posterior mean density estimates given n = 50 or n = 100 observations from the normal 
distribution A/"(0, 1), or from the Student-t4(0, 1) distribution. Without the explicit basis 
functions the latent function goes towards the prior mean, that is the mean log density, in 
the tails where there are only few or no observations. If the explicit basis functions match 
the true log density (the first and second plots) or if they do not match but n is small (the 
third plot), tail log densities are estimated mainly using the basis functions. If the explicit 
basis functions do not match the true density and n is large, basis functions get smaller 
weight and the tail areas are estimated using the nonparametric part (the fourth plot). 

4.2 Simulated ID data 

Figure [3] shows the density estimates and the corresponding 95% credible intervals for single 
random realisations from simulated datasets with n = 100. The simulated datasets are: 

• t^. Student^ (0, 1) 
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1) MCMC-LGP 

2) CCV-mixture 

3) DCV-mixture 



123 123 123 123 

Figure 4: The pairwise comparison of Laplace-LGP to MCMC-LGP, CCV-mixture and 
Mixture-DCV. Plot shows the distribution of differences in the mean log- 
predictive density (MLPD) along with the median and 95% lines. Values above 
zero indicate that a method is performing better than Laplace-LGP. MLPDs were 
computed using the true density as the reference. Violin plots were produced us- 
ing Laplace-LGP from the results of 100 independent random repetitions. 



• Mixture of two i 4 : §i 4 (0, 1) + jt 4 (3, ^) 

• Gamma: Gam(l, g) 

• Gamma+Gaussian: | Gam(l, ^) + p-). 

Student's i 4 -distribution was chosen as a simple unimodal distribution with thicker tails 
than Gaussian. The mixture of two ^-distributions was chosen as a more challenging case, 
where there are two separate modes with different widths. In the second plot of Figure |3j 
it can be seen that the short length-scale required to model the narrow peak, makes the 
density estimate bumpy also elsewhere. Gamma was chosen as a simple distribution with a 
mode on the boundary, and Gamma+Gaussian as a more difficult case with one mode on 
the boundary and another mode in the middle of the distribution. Gamma+Gaussian was 



also used by Tokdar (2007) and Adams (2009). 



Figure [4] shows the pairwise comparison of Laplace-LGP to MCMC-LGP, CCV-mixture 
and DCV-mixture. We made 100 realisations of n = 100 random samples from the true 
density and computed for each method the mean log-predictive densities (MLPD) over 
the true distribution. Distributions of the differences between MLPDs are plotted with 
violin plots generated using Laplace-LGP (grid size 200) along with the median and 95% 
lines. There are no statistically significant differences between the methods for the first two 
datasets. For the last two datasets there is no statistically significant difference between 
Laplace-LGP and MCMC-LGP, but the Gaussian mixture models do not work well for the 
data with the mode on the boundary. The violin plots in Figures [4] and [6] also illustrate 
another practical application of density estimation with the LA approach which facilitates 
interactive visualisation. 

4.3 Real ID data 

Figure [5] shows the density estimates and the corresponding 95% credible intervals for the 
following real datasets: 
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Galaxy 



Enzyme 



Log acidity 



Sodium lithium 




95% CI 
1 Mean 



Figure 5: Example results of density estimation and related uncertainties for four different 
real datasets with n = {82, 245, 155, 190}. 



• Galaxy n = 82 

• Enzyme n = 245 

• Log acidity n = 155 



Sodium lithium n = 190, 



which were studied by Griffin (2010). The density estimates are visually similar to the 



results by |Griffin (2010, Figure 5). 

Figure [6] shows the pairwise comparison of Laplace-LGP to MCMC-LGP, CCV-mixture 
and DCV-mixture with the real datasets. For each method, we computed leave-one-out 
cross-validation mean log-predictive densities (CV-MLPD). Samples from the distributions 



of the differences between CV-MLPDs were obtained using the Bayesian bootstrap (Rubin 



1981, Vehtari and Lampinen, 2002) and violin plots were generated using Laplace-LGP. 



There is no substantial difference in the performances of Laplace-LGP and MCMC-LGP 
across the datasets. CCV-mixture and DCV-mixture perform better than Laplace-LGP for 
the Galaxy and Log acidity datasets, whereas Laplace-LGP performs slightly better than 
the mixture models for the Enzyme and Sodium lithium datasets. 



4.4 The effect of the number of data and grid points 

Figure [7] illustrates the effect of the number of grid and data points to the accuracy of 
Laplace-LGP and MCMC-LGP. The number of grid points was 400 when the number of 
data points was varied, and the number of data points was 100 when the number of grid 
points was varied. For each combination, we made 100 realisations of random samples 
from the Mixture of two t± and Gamma+Gaussian distributions. For both datasets KL 
divergence approaches to zero when the number of data points increase. It seems that 
about 50-100 grid points are sufficient for the Mixture of two t$ distribution, whereas the 
Gamma+Gaussian data is less sensitive to the number of grid points. There is no practical 
difference in the performances of Laplace-LGP and MCMC-LGP. 

We measured computation times for density estimation with different grid sizes using 
100 random samples from the Student^ distribution. Laplace-LGP with the grid sizes 
(50, 100, 200, 400, 900) took about (0.1, 0.2, 0.3, 0.9, 4.5) seconds using four cores of Intel(R) 
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Galaxy Enzyme Log acidity Sodium lithium 




1) MCMC-LGP 

2) CCV-mixture 

3) DCV-mixture 



Figure 6: The pairwise comparison of Laplace-LGP to MCMC-LGP, Mixture-CCV and 
Mixture-DCV. Plot shows the distribution of differences in the cross-validation 
mean log-predictive density (CV-MLPD) along with the median and 95% lines. 
Values above zero indicate that a method is performing better than Laplace-LGP. 
CV-MLPDs were computed using leave-one-out cross-validation. Violin plots 
were produced with Laplace-LGP given 1000 Bayesian bootstrap draws from the 
distribution of CV-MLPD. 




— e — Laplace 
- * - MCMC 



200 400 200 400 200 400 200 400 



# data points # grid points # data points # grid points 

Figure 7: The effect of the number of data and grid points to the accuracy of density 
estimation with two simulated datasets. The number of grid points was 400 
when the number of data points was varied, and the number of data points was 
100 when the number of grid points was varied. For both datasets, as the number 
of data points increases, the KL divergence decreases to near zero. About 50-100 
grid points seem to be sufficient for the Mixture of two t4-distributions, whereas 
the Gamma+Gaussian distribution is less sensitive to the number of grid points. 
There is no practical difference in the performances of Laplace-LGP and MCMC- 
LGP. 



Xeon(R) 2.67GHz. If the matrix-vector multiplications in Newton's algorithm were made 
with FFT, the times were about (0.3, 0.3, 0.4, 0.8, 3.4) seconds. MCMC-LGP with the same 
grid sizes took approximately (23,32,61,180,900) seconds. Using Griffin's code with the 
default options, CCV-mixture took about 200 seconds and DCV-mixture about 800 seconds. 
Note that these time comparisons are only approximate, and the computation times depend 
considerably on the specific implementation and on the chosen convergence criteria. 
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Figure 8: Example results of 2D density estimation for four different simulated datasets. 

The upper row shows the contour plots of the true densities and the lower row 
shows the contour plots of the estimated densities with n = 100 observations. 



4.5 Simulated 2D data 

The upper row of Figure [8] shows the following four 2D distributions: 



Student-ts: *s(0, 



1 .7 
.7 1 



) 



• Mixture of two Gaussians: ^jV 



1 
1 



Banana: x\ ~ JV(0, 10 2 ), x 2 ~ N(^x\ - |, 1) 
'3 \cosv~] f0.2 2 



J —IT 



COS if 

sin (p 



0.2 2 
0.2 2 



dip. 



Given a random sample with n = 100 from each distribution, we compute the 2D density 
estimates with Laplace-LGP. The mean estimates are illustrated in the lower row of Figure 
[H| Density estimation for each plot with a 20 x 20 grid took about two seconds. 

We tested the reduced-rank approximation with two 2D datasets. Given n = 100 ob- 
servations from the Student-is and Mixture of two Gaussians distributions, we measured 
computation times and the KL divergences from the true distribution to the estimated ones 
with different grid sizes. Figure [9] shows the elapsed times and the KL divergences as a 
function of the grid sizes for Laplace-LGP with the exact prior covariance (Full) and with 



the reduced-rank prior covariance (Kron) of equation (17). We formed the reduced-rank 
approximation by excluding all the eigenvalues smaller than 10 -6 , or taking at most 50% of 
all the eigenvalues. The differences between the exact and the reduced-rank prior are small 
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20 30 40 50 20 30 40 50 20 30 40 50 20 30 40 50 
# grid points (sqrt) # grid points (sqrt) # grid points (sqrt) # grid points (sqrt) 



Figure 9: The comparison of Laplace-LGP with the full prior covariance (Full) and with the 
reduced-rank approximation (Kron.) of equation (17) for two 2D datasets. The 
performances in the KL sense are similar, but for grid sizes larger than 30 x 30, 
the computation times are more expensive with the full prior covariance matrix. 



in the KL sense, but for the grid sizes larger than 30 x 30, Laplace-LGP with the full prior 
covariance matrix becomes computationally more expensive. 



4.6 Density regression with simulated data 

The estimation of a conditional density in a grid with the LA approach is essentially similar 
to ID density modelling with Laplace-LGP. Therefore, in this paper, we consider density 
regression with only one simulated dataset. 



We demonstrate density regression with a simulated case studied by (Kundu and Dun- 



son 



2011). The first plot of Figure |10| shows the simulated density regression scheme, 
where the predictors z% (i = 1, . . . , 50) are generated from the following trimodal density: 
^Af(-§, (|) 2 ) + ^AA(|, (§ ) 2 ) + jqM(0, (\) 2 ). The generating model for a univariate re- 
sponse is ti = Aexp ^— j^Jz. j + 1 ^ z g ' z . €i, where ej ~ AA(0, of). We fixed A = 3 and a e = 1. 
The second and third plots of Figure [10] show estimates with the Laplace approximation 
(Laplace-DR) and the MCMC sampling (MCMC-DR) given a single realisation of n = 50 
samples. Density regression in a 20 x 20 grid with Laplace-DR took about three seconds and 
with MCMC 220 seconds. Finally, to show the differences between the estimates obtained 
with Laplace-DR and Laplace-LGP, we illustrate a conditional density estimate computed 
directly from a 2D density estimate obtained with Laplace-LGP (the fourth plot of Figure 



10). 



5. Discussion 

We have presented approximate inference for logistic Gaussian process density estimation 
with Laplace's method. Empirical results with ID datasets indicate that the accuracy of 
the proposed approach is close to the slower state-of-the-art MCMC methods for density 
estimation. The logistic Gaussian process with Laplace's method also avoids the sampling 
and convergence assessment problems related to the highly multimodal posteriors of the 
mixture models. 
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Figure 10: An illustration of density regression with a simulated dataset. Dots represent 
a single realisation of n = 50 samples. The first plot shows the percentiles of 
the true conditional density, and the second and third plots show the estimates 
with the Laplace approximation (Laplace-DR) and with the MCMC sampling 
(MCMC-DR). The fourth plot shows the conditional density estimate computed 
from a 2D density estimate with Laplace-LGP. 



Density estimation with Laplace-LGP and 400 grid points takes about two seconds, 
which in interactive use is still reasonable waiting time. For a finer grid or more dimensions, 
more grid points could be placed, but this requires additional approximations. In this 
paper, we have considered a reduced-rank approximation for Laplace-LGP that avoids the 
infamous cubic scaling of the basic Gaussian process computation. In addition, we have 
demonstrated the suitability of the Laplace approach for estimating conditional densities 
with one predictor variable. 

Instead of Laplace's method, other deterministic approximations could be applied to 
speed up the posterior inference. For example, expectation propagation (EP) has been 
shown to perform better than the Laplace or variational approximations for binary classi- 



fication (Nickisch and Rasmussen, 2008) and for the Student-t regression (Jylanki et al. 



2011 ). However, for Poisson model EP was only slightly better than the Laplace approxima- 



tion (Vanhatalo et al. 2010). Because the Laplace approximation for the logistic Gaussian 
process performed almost as well as the MCMC approximation, we believe that EP could 
improve only slightly the performance of LA, but would increase the computation time. Fur- 
thermore, the implementation of expectation propagation for non-diagonal W is non-trivial, 
although quadrature-free moment matching could be considered for density estimation, in a 



similar way as was done for multiclass classification with GP priors (Riihimaki et al. 2012). 
However, in our preliminary testing, the moment matching turned out to be quite slow. 

The code for Laplace-LGP, MCMC-LGP and violin plot will be made available as 
part of the free GPstuff MATLAB toolbox ([httpT77becs.aalto.fi/en/researcri/bayes/ 
|gpstuff/p . 
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